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Abstract 

In a regression setup with deterministic design, wc study the pure aggregation problem 
and introduce a natural extension from Gaussian to distributions in the exponential family. 
While this extension bears strong connections with generalized linear models, it does not 
require identifiability of the parameter or even that the model on the systematic component 
is true. It is shown that this problem can be solved by constrained and/or penalized likelihood 
maximization and we derive sharp oracle inequalities that hold both in expectation and with 
high probability. A new proof technique that exploits the structure of the loss function is 
employed. It yields error bounds that are accurate already for small sample sizes. Finally 
all the bounds are proved to be optimal in minimax sense. 

Mathematics Subject Classifications: Primary 62G08, Secondary 62J12, 68T05, 62F11. 
Key Words: Aggregation, Regression, Classification, Oracle inequalities, finite sample bounds, 
Generalized linear models, logistic regression, minimax lower bounds. 
Short title: Kullback-Leibler aggregation. 

1 Introduction 

The last decade has witnessed a growing interest in the general problem of aggregation, which 
turned out to be a flexible way to capture many statistical learning setups. Originally introduced 
in the regression framework by Nemirovski (2000) and Juditsky and Nemirovski (2000) as an 
extension of the problem of model selection, aggregation became a mature statistical field with 
the papers of Tsybakov (2003) and Yang (2004) where optimal rates of aggregation were derived. 
Subsequent applications to density estimation (Rigollet and Tsybakov, 2007) and classification 
(Belomestny and Spokoiny, 2007) constitute other illustrations of the generality and versatility 
of aggregation methods. In the pure aggregation setup, it is assumed that a collection of functions 
is given and the goal is to find a linear combination of these functions that exhibits a small risk. 

The general problem of aggregation can be described as follows. Consider a finite family T-L 
(hereafter called dictionary) of candidates for a certain statistical task. Assume also that the 
dictionary T-L belongs to a certain linear space so that linear combinations of functions in 7i 



* Princeton University. 



1 



remain plausible candidates. For example, such candidates can be estimators constructed from 
a hold-out sample or simply basis functions or any system of function with good approximation 
properties. Indeed, the theory of aggregation is developed under minimal conditions on the 
dictionary . Building on original results regarding model selection for density estimation (Yang, 
2000) and regression (Catoni, 2004), Nemirovski (2000) identified two new types of aggregation: 
convex aggregation, where the goal is to mimic the best convex combination of candidates in 7i 
and linear aggregation where the goal is to mimic the best linear combination of candidates 
inn. 

One salient feature of aggregation as opposed to standard statistical modeling, is that it 
does not rely on an underlying model. Indeed, the goal is not to estimate the parameters of an 
underlying 'true' model but rather to construct an estimator that mimics the performance of 
the best model in a given class, whether this model is true or not. From a statistical analysis 
standpoint, this difference is significant since performance cannot be measured in terms of pa- 
rameters: there is no true parameter. Rather, a stochastic optimization point of view is adopted 
in aggregation. If C denotes a class of linear combinations of functions in Ti and R{-) denotes a 
convex risk function, the goal pursued in aggregation is to construct an aggregate estimator h, 
measurable with respect to the data at hand and such that 

JER{h) < Cm.mR{f) + e, 

where e is a small term that characterizes the performance of the given aggregate h. As illustrated 
below, the remainder term e is an explicit function of M and n that shows the interplay between 
these two fundamental parameters. Such oracle inequalities with optimal remainder term e were 
originally derived by Yang (2000) and Catoni (2004) for model selection in the problems of 
density estimation and Gaussian regression respectively. They method that they used, called 
progressive mixture was later extended to more general stochastic optimization problems in 
Juditsky et al. (2008). However, only bounds in expectation have been derived for this estimator 
and it is argued in Audibert (2007) that this estimator cannot achieve optimal remainder terms 
with high probability. One contribution of this paper (Theorem 4.2) is to develop a new estimator 
that enjoys this desirable property. 

When the model is misspecified, the minimum risk satisfies minjgci?(/) > 0, and it is 
therefore important to obtain a leading constant C = 1. Many oracle inequalities with leading 
constant term C > 1 can be found in the literature for related problems. Oracle inequalities 
in Yang (2004) also exhibits a constant C > 1 but in that paper, the class C = C„ actually 
depends on the sample size n so that min f^c„ -^(/) goes to as n goes to infinity under additional 
regularity assumptions. In this paper, we focus on the pure aggregation setup as defined by 
Nemirovski (2000) and Tsybakov (2003) where the class C is fixed and remains very general. 
As a result, oracle inequalities considered here only have leading constant C = 1. Because they 
hold for finite M and n, such oracle inequalities are truly finite sample results. 

Recall that the setup of Tsybakov (2003) is the following. The observations consist of n i.i.d 
copies of the random couple (X, Y) where X has known marginal distribution and Y = f{X) 
where ^ is a centered random variable with bounded variance and that is independent of X. 
Furthermore, in the case of convex aggregation, it is assumed that ^ has a normal distribution 
AA(0,a2). 
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We consider an extension of aggregation for Gaussian regression that encompasses distribu- 
tions for responses in a one-parameter exponential family, with particular focus on the family 
of Bernoulli distributions in order to cover binary classification. A natural measure of risk in 
this problem is related to the Kullback-Leibler divergence between the distribution of the actual 
observations and that of observations generated from a given model. In a way, this extension is 
close to generalized linear models (see, e.g. Mccullagh and Nelder, 1989), which are optimally 
solved by maximum likelihood estimation (see, e.g., Fahrmeir and Kaufmann, 1985). However, 
in the present aggregation framework, it is not assumed that there is one true model but we 
prove that maximum likelihood estimators still perform almost as well as the optimal solution 
of a suitable stochastic optimization problem. This generalized framework encompasses logistic 
regression as a particular case. 

Throughout the paper, for any x G IR^^, let Xj denote its j-th coordinate. In other words, any 
vector X G IR*''^ can be written x = (xi, . . . ,xm)- Similarly an n x M matrix H has coordinates 
Hi J, l<i<n,l<j< M. The derivative of a function 6 : IR — )• IR is denoted by h' . Finally for 
any real valued function /, we denote by ||/||oo = sup^, |/(x)| E [0, oo], its sup norm. 

The paper is organized as follows. In Section 2, we recall a few important results about 
generalized linear models and some of their extensions. Then, in Section 3, we define the 
problem of Kullback-Leibler aggregation, which is the counterpart of generalized linear models 
but in a aggregation framework where the model may be misspecified. In particular, we exhibit 
a natural measure of performance that suggests the use of constrained likelihood maximization 
to solve it. Exact oracle inequalities, both in expectation and with high probability are gathered 
in Section 4 and their optimality for finite M and n is assessed in Section 5. These oracle 
inequalities for the case of large M are illustrated on a logistic regression problem, similar to the 
problem of training a boosting algorithm, in Section 6. Finally, Section 7 contains the proofs of 
the main results together with useful properties on the concentration and the moments of sums 
of random variables with distribution in an exponential family. 

2 Generalized linear regression 

2.1 Setup and notation 

Let y G 3^ C IR be a random variables and let y denote the convex hull of y. Let X be an 
abstract space and define the regression function f : X ^ y oiY onto X by /(x) = IE[y|X = x\. 
The function / is unknown and we observe a sample (xi, Yi), . . . , (x^, Y^) where the Xi £ X ,i = 
1, . . . , n are deterministic and the Yi G y^i = l,...,n are independent random variables such 
that W.[Yi\xi] = f{xi). 

Consider the equivalence relation ~ on the space of functions / : — t- IR that is defined 
such that / ~ 5 if and only if /(xj) = g{xi) for all i = 1, . . . , n. Denote by Qi;n the quotient 
space associated to this equivalence relation and define the norm || • || by 

1 " 
n ^-^ 

i=l 

Note that || • || is a norm on the quotient space but only a seminorm on the whole space of 
functions / : A' — )• IR. In what follows, it will be useful to define the inner product associated 
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to II • II by 

1 " 

(/,5') = - '^f{xi)g{xi) . 



n 
1=1 



Using this inner product, we can also denote the average of a function / by 

n 



n 

i=l 



where ll(-) is the function in Qi:„, that is identicahy equal to 1. 
2.2 Generalized linear models 

Recall that a random variable Y £lR has distribution in a (one-parameter) canonical exponential 
family if it admits a density with respect to a reference measure on IR given by 

Ky;^)=exp{^^-^ + c(y)}. (2.1) 

The parameter G G C IR is called canonical parameter and a > 0, b{-) and c(-) are given. In the 
rest of this paper, we only consider functions b{-) that are twice continuously differentiable. When 
the reference measure is the Lebesgue measure on IR, exponential families encompass Gaussian 
or Gamma distributions. Discrete distributions that admit such a density with respect to the 
counting measure on the set of integers Z include Poisson and Bernoulli. A detailed treatment of 
exponential families of distributions together with examples can be found in Barndorff-Nielsen 
(1978); Brown (1986); Mccullagh and Nelder (1989) and in Lehmann and Caseha (1998). Sev- 
eral examples are also treated in Section 6 of the present paper. It can be easily shown that if 
Y admits a density given by (2.1), then 

nY] = b\e), 

var[y] = ab"{9) . ^ ' ' 

We assume hereafter that distribution of Y is not degenerate so that (2.2) ensures that b is 
strictly convex and b' is a bijection onto its image space. 

Generalized linear models constitute a rich and versatile collection of models to estimate 
the function /, that allow Y to have a variety of distributions. Such models assume that 
the conditional distribution of the observation Yi belongs to a given exponential family with 
expectation E[Yi\ = f(xi),i = l,...,n. The dependency in x is modeled by a systematic 
component r] : X ^ IR such that g o /(x) = r]{x) where 5 : 3^ — )• IR is a link function. The 
choice of the link function is part of the modeling process but typically results from the following 
considerations. From (2.2), we have 

gob'{9) = r] ^ e = h{7]), 

where h = {^g o b'^ ^ . In the rest of the paper, we only consider the so-called canonical link 
function defined hy g = {b')~^ so that h is the identity map and therefore the canonical parameter 
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9 itself is modeled by the systematic component i]. Note that the technique employed in our 
proofs does not apply to other link functions in a straightforward way. 

The strongest assumption on which the model relies is on the systematic component rj. In 
generalized linear models, it is assumed that X C M'^ and that 77 is a linear function of x, 
i.e., r]{x) = /3~^x, where f3 G IR"' is a parameter to be estimated. Other forms for r] have been 
considered in related models such as generalized additive models (Hastie and Tibshirani, 1990) 
and extended additive models (Friedman et al., 2000), which do not require that X CW^. Let 
H = {/i, . . . , /m} be a dictionary of functions /j : Af — )• ]R and for any A £ IR*'^, define the 
linear combination 

M 

fA = EVi- (2-3) 

An extended additive model assumes that i] is one of such linear combinations, namely that 
there exists A G IR*'^ such that for any 2; G Af: 

M 

ri{x) = fx{x) = ^ Aj/j(x) . 
i=i 

Extended additive models can be embedded in the more general problem of aggregation, 
which does not assume that the data is generated from one particular model but tries to mimic 
the model that is the closest to the true distribution of the data. In the next section, we recall the 
problem of aggregation as originally defined by Nemirovski (2000) for Gaussian regression and 
extend it along the same lines as extended additive models to distributions in the exponential 
family. 



3 Kullback-Leibler aggregation 

3.1 The problem of aggregation 

Aggregation for the regression problem was introduced by Nemirovski (2000) and further devel- 
oped by Tsybakov (2003) where the author considers a regression problem with random design 
that has known distribution. We now recall the main ideas of aggregation applied to the regres- 
sion problem, with emphasis on its difference with the linear regression model and what the new 
challenges for this problem are. 

In the framework of the previous section, consider a finite dictionary = {/i, . . . , /a/} such 
that ll/jll is finite and for any A S H*^, recall that fA denotes the linear combination of fj's 
defined in (2.3). Consider the following regression model with additive noise: we observe n 
independent random couples (xj, 1^), i = 1, . . . , n such that 

Y, = f{xi)+ei, (3.4) 

where e^, i = 1, . . . , n, are some noise random variables. The goal of aggregation is to solve the 
following optimization problem 

min||fA-/f, (3.5) 
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where A is a given subset of JR and / is unknown. Previous papers on aggregation in the 
regression problem have focused on three choices for the set A corresponding to the three different 
problems of aggregation originally introduced by Nemirovski (2000): A is the whole space 
(linear aggregation), A is the flat simplex of IR^^ (convex aggregation) and A is the finite 
set formed by the M vectors in the canonical basis of IR*^ (model selection aggregation) (see 
Buneaet al, 2007; Nemirovski, 2000; Juditsky and Nemirovski, 2000; Tsybakov, 2003; Yang, 
2004, and subsection 3.2 below). In practice, the regression function / is unknown and it is 
impossible to perfectly solve (3.5). Our goal is therefore to recover an approximate solution of 
this problem in the following sense. We wish to construct an estimator such that 

||f^^-/f -mm||f,-/f , (3.6) 

is as small as possible. An inequality that provides an upper bound on the (random) quantity 
in (3.6) in a certain probabilistic sense is called oracle inequality. 

Notice that this is not a linear model since we do not assume that the function / is of the 
form f^ for some A G IR^^. Rather, the bias term min^eA IKa — /|P may not vanish and the goal 
is to mimic the linear combination with the smallest bias. 

The notion of Kullback-Leibler aggregation defined in the next subsection broadens the scope 
of the above problem of aggregation to encompass other conditional distributions for Y given X. 



3.2 Kullback-Leibler aggregation 

Recall that the ubiquitous squared norm || • |p as a measure of performance for regression 
problems takes its roots in the Gaussian regression model. The Kullback-Leibler divergence 
between two probability distributions P and Q is defined by 

I oo otherwise. 

Denote by Pf the joint distribution of the observations Yi,i = 1, . . . , n under (3.4). If the noise 
random variables Ej in (3.4) are i.i.d A/'(0,(7^) random variables, then 



IC{Pf\\Pg) = ^\\f - 9 



|2 



In order to allow an easier comparison between the results of this paper and the literature, 
consider a normalized Kullback-Leibler divergence defined by IC{Pf\\Pg) = IC{Pf\\Pg)/n . In the 
Gaussian regression setup, the quantity of interest in (3.6) can be written 

]E£(P^||Pf,J-min£(P;||PfJ, (3.7) 

up to a multiplicative constant term equal to 2cj^. Nevertheless, the quantity in (3.7) is mean- 
ingful for other distributions in the exponential family. 

Consider n independent observations {xi,Yi),i = l,...,n from the regression model (3.4) 
and assume that the distribution of Yi has density given by 

p{y- e,) = exp {^^^-^ + c(y)} , (3.8) 
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where 9i = {b')^^ o f{xi). 

Given a subset A of IR*^, the goal of Kullback-Leibler aggregation (in short KL-aggregation) 
is to construct an estimator A„ such that the excess-KL defined by 

£:KL(f3,„, A,^) = IC{Pf\\Pb'of, ) - inf/;(P/||n'ofJ , (3.9) 

is as small as possible. 

Whereas KL-aggregation is a purely finite sample problem, it bears connections with the 
asymptotic theory of model misspecification as defined in White (1982), following LeCam (1953) 
and Akaike (1973). White (1982) proves that if the regression function / is not of the form 
f = b' ofx for some A in the set of parameters A, then under some identifiability and regularity 
conditions, the maximum likelihood estimator converges to A* defined by 

A* = argmin/C(P/||Pb'of;,) • 
AeA 

Since we plan to solve KL-aggregation with the maximum likelihood estimator, upper bounds 
on the excess-KL can be interpreted as finite sample versions of those original results. 

Note that assuming that Yi admits a density of the form (3.8) with known cumulant function 
6(-) is a strong assumption unless Yi has Bernoulli distribution, in which case identification of 
this distribution is trivial from the context of the statistical experiment. We emphasize here 
that model misspecification pertains only to the systematic component. 

Remark 3.1 Bounds on the excess-KL can also be interpreted in terms of density estima- 
tion with the Kullback-Leibler divergence as a measure of performance. Lndeed, the function 
p{y; fx{x)) is a natural candidate to estimate the conditional density ofY given X = x, where 
p{-;-) is given by (2.1). 

We will consider the three choices for A that are now standard and correspond to the three 
standard problems of aggregation originally introduced by Nemirovski (2000). Following the 
work of Tsybakov (2003) we provide upper bounds in the form of oracle inequalities together 
with minimax lower bounds to assess the optimality of said upper bounds. As we will see in the 
following sections, optimal rates are essentially the same for the problem of KL-aggregation as 
for the Gaussian regression model studied in Tsybakov (2003). 

Model selection aggregation. The goal is mimic the best fj in the dictionary Ti. There- 
fore, we can choose A to be the finite set V = {ei, . . . , cm} formed by the M vectors in the 
canonical basis of M^. The optimal rate of model selection aggregation in the Gaussian 
case is (log M)/n. 

Linear aggregation. The goal is mimic the best linear combination of the /j's in the dictio- 
nary Ti. Therefore, we can choose A to be whole space IR*^. The optimal rate of linear 
aggregation in the Gaussian case is M/n. 

Convex aggregation. The goal is mimic the best convex combination of the /j's in the 
dictionary Ti. Therefore, we can choose A to be the rescaled flat simplex of M*^ denoted 
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by A.'l{R) and defined by 

{M 
A G IR^-^ : Xj >0,j = 1,...,M, "^\j = R 

where i? > 0. In the sequel, upper bounds are stated for any subset of the £i ball of 
with radius R> denoted by Ai{R) and defined by 



M 



MR) = { 



XeM^^ -.^IXjl < r\ . (3.10) 



While this set is more massive than A'^{R), it results in bounds that are deteriorated by 
only a factor 2. The optimal rate of convex aggregation in the Gaussian case is (Af/n) A 
Vlog(l + Af/V^)/n. 

Note that we use hereafter a looser definition of linear aggregation where A is not restricted to 
be the whole space IR*^ but can also be any closed convex subset of IR^^ such as an ^oo ball of 
IR*^. In this sense, convex aggregation can be viewed as a special case of linear aggregation. 



4 Main results 

Let Z = {{xi,Yi), . . . , {xn,Yn)} be n independent observations such that for each i, the density 
of Yi is of the form p{yi; Oi) as defined in (2.1) where 9i = (6')~^ ° fi^i)- Then, we can write for 
any A G IR^^, 



/C(P;||n'ofJ = --{{f,h) - {boh,^)) - J]lE[c(y,)] +Ent(P^), (4.1) 

i=l 

where Ent(P/) denotes the entropy of Pf and is defined by 

n 

Ent(P/) = J^IE [log (p (y,; (6')"' o f{x,)))] . 



1=1 



Note the term — Y17=i '^i^O^i)] + Ent(Pj) does not depend on A. 

Recall that the log-likelihood of an estimator 6 = (9i, . . . , 9^) G IR" based on these observa- 
tions is defined by 

i:iogb(y.;e;)] =x;{^^^^^+-(^o}. 



a 

i=l i=l 



Therefore, for estimators of the form 9i = ^\{xi), we are interested in maximizing the function 

n 



i=l 
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over a certain set A that depends on the problem at hand. 

We now give a series of bounds for the problem of KL-aggregation. All proofs are gathered in 
Section 7 and rely on the following conditions, which can be easily checked given the cumulant 
function b. 

Condition 1 The set of admissible parameters is Q = TR and there exists a positive constant 
such that 

supb"{e) < B^, 

eee 

Condition 2 We say that the couple (H,A) satisfies condition 2 if there exists a positive con- 
stant K? such that 

b"{fx{x))>K\ 

uniformly for all x & X and all X € A. 

Conditions 1 and 2 are discussed in the light of several examples in section 6. Condition 1 is used 
only to ensure that the distributions of Yi have uniformly bounded variances and sub-Gaussian 
tails whereas condition 2 is a strong convexity condition that depends not only on the cumulant 
function b but also on the aggregation problem at hand that is characterized by the couple 
(^,A). 



4.1 Model selection aggregation 

Recall that the goal of model selection aggregation is to mimic a function fj such that /C(Pj ||Pf,'o/ ) < 
IC{Pf \\Pbiofi^),k ^ j. A natural candidate would be the function in the dictionary that maximizes 
the function in defined in (4.2) either over the finite set V = {ei, . . . ,6^/} formed by the M 
vectors in the canonical basis of IR or over the convex hull of V which is given by A+(l). How- 
ever, it has been established (see, e.g., Juditsky et al., 2008; Lecue, 2007b; Lecue and Mendelson, 
2009) that such a choice yields suboptimal rates of convergence in general. As a consequence 
we resort to a more sophisticated example obtained by penalized log-likelihood minimization. 

Let /3 > 0, be a tuning parameter to be chosen large enough and define A G Af{l) to be the 
unique vector that solves the following convex optimization problem: 

A = argmax J ^ A/„(e,) + 4(A) + PH{X) I , (4.3) 
AeA+(i) [j=i J 

where 

M 

i=i 

denotes the entropy of the vector A G A^(l) when regarded as a probability distribution on the 
finite set V. We propose to solve model selection aggregation using the convex combination f^^. 
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Before giving bounds on the excess-KL of f^^, we comment on the main difference between A 
and the following commonly used vector of weig hts A e Af{l) defined by: 




e ^ 



j = l,...,M. 



(4.4) 




/3 



It can be shown (see, e.g., Dupuis and Ellis, 1997, Proposition 1.4.2) that A is the unique solution 
of the following optimization problem 



The use of exponential weights to perform optimal model selection aggregation goes back to 
Catoni (2004) and Yang (2000) with the progressive mixture rule. Like its generalization pro- 
posed in Juditsky et al. (2008), it requires an additional and somewhat counterintuitive averag- 
ing step but the optimal rates of model selection that these estimators yield contributed to the 
general belief that this extra step was necessary. Recently, Audibert (2007) argued that such 
estimators yield suboptimal rates with high probability even though they behave optimally in 
expectation. 

The estimator A that we use here does not take the form of exponential weights and although 
both criteria are penalized by the entropy, they are quite different. In particular, in the proof of 
the following theorems, a key ingredient is the fact that the function f^ i— > "^fLi ^j^ni^j) +^n(A) 
is strongly concave. 

Theorem 4.1 Assume that condition 1 holds and that {T-L, Af{l)) satisfies condition 2. Recall 
that V = {ei, . . . , cm} is the finite set formed by the M vectors in the canonical basis of IR^^. 
Then, the aggregate with A defined m(4.3) and f3 > SB'^u/k^ satisfies 



A similar result for where A is given in (4.4) was obtained by Dalalyan and Tsybakov (2007) 
for a different class of regression problem with deterministic design under the squared loss. 
For random design, Juditsky et al. (2008) obtained essentially the same results for the mirror 
averaging algorithm. Also for random design, Lecue and Mendelson (2009) proposed a different 
estimator to solve this problem and give for the first time a bound with high probability with 
the optimal remainder term. Such a result was claimed by Audibert (2007) for a different 
estimator but comes without proof. Despite this recent effervescence, no bounds that hold 
with high probability have been derived for the deterministic design case considered here and 
the estimator proposed by Lecue and Mendelson (2009) is based on a sample splitting argument 
that does not extend to deterministic design. The next theorem aims at giving such an inequality 
for the aggregate f;. 




IE[£:kl(/"x,V,^)] < 



f3 log M 



(4.5) 



a 



n 
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Theorem 4.2 Assume that condition 1 holds and that {T-L, Af {!)) satisfies condition 2. Recall 
that V = {ei, . . . ,eA/} is the finite set formed by the M vectors in the canonical basis o/M^. 
Then, for any 5 > 0, with probability 1 — 5, the aggregate with X defined in (4.3) and /3 > 
SB'^u/k? satisfies 

£KUfx„,v,n)<- — - — , (4.6) 

The proofs of both theorems are gathered in subsection 7.2. 



4.2 Linear aggregation 

Let A C be a closed convex set or IR^''^ itself. The maximum likelihood aggregate over 

An 



A C ]R^^ is uniquely defined as a function in the quotient space Qi-n by the linear combination f? 



with coefficients given by 

A„ G argmax4(A) . (4.7) 



A6A 



Note that both and A* G argmin;^^^ IC{Pf\\ Phi of ^) exist as soon as A is a closed convex set (see 
Ekeland and Temam, 1999, Chapter ii, Proposition 1.2). Likewise, from the same proposition, 
we find that if A = M*^, condition 2 entails that both A„ and A* exist. Indeed, under condition 2, 
the function b is convex coercive and thus both functionals 



h^-^{YMx^)-{bofx,l)} 



i=l 



and 



fA^-(/,fA) + (6ofA,lI) 

are convex and coercive. Thus, the aggregates f^* and are uniquely defined as functions in 

the quotient space Qi.n, even though A* and A„, may not be unique. 

If the observations Z were actually drawn from an exponential family with canonical pa- 
rameter 9*, we could apply the asymptotic theory of maximum likelihood estimation to obtain 
consistency results. The goal here is not only to derive bounds on the quantity in (3.7) without 
assuming that the model holds but also to have precise finite sample bounds that explicitly 
depend on the sample size n and the size M of the dictionary. 

We first extend the original results of Nemirovski (2000) and Tsybakov (2003) by providing 
bounds on the expected excess-KL, IE[<fKL(fx ,A,^)] where A is either a closed convex set or 
A = IR^^, which corresponds to the problem of linear aggregation. 

Theorem 4.3 Let A be a closed convex subset ofJR,^^ or ]R*^ itself, such that {H,h) satisfies 
condition 2. If the marginal variances satisfy lEfli — /(xj)]^ < cr^ for any i = 1, . . . ,n, then the 
maximum likelihood aggregate over A satisfies 
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where D < M is the dimension of the linear span of % and 



X* G argmin/C(P/||Pfe'of-;^) • 



A6A 



Vectors A* G argmin^^gy^ /C(Pj||Pfe'ofA) are oracles since they cannot be computed without the 
knowledge of Pf. The oracle distribution Pb' of ^* corresponds to the distribution of the form 
Pb'ofx , A € A that is the closest to the true distribution Pj in terms of KuUback-Leibler diver- 
gence. Introducing this oracle allows us to assess the performance of the maximum likelihood 
aggregate, without assuming that Pj is of the form Pb'ofx for some A G A. Notice also that 
from (2.2), the bounded variance condition lE[Yi - /(xj)]^ < (T^ is a direct consequence of 
condition 1 with cr^ = aB'^. 

Theorem 4.3 is valid in expectation. In other words it characterizes the rates of KL- 
aggregation attained by the maximum likelihood aggregate in average with respect to the real- 
izations of the sample Z. The following theorem shows that these bounds are not only valid in 
expectation but also with high probability. 

Theorem 4.4 Let A be a closed convex subset o/lR'^^ or H*^ itself and such that ("H, A) satisfies 
condition 2. Moreover let condition 1 hold and let D be the dimension of the linear span of the 
dictionary % = {/i . . . , /a/}- Then, for any 5 > 0, with probability 1 — 5, the maximum likelihood 
aggregate fx over A satisfies 



where A* G argmin;)^^;^ ^(-P/ll-Pfo'o/^)- 

We see that the price to pay to obtain bounds with high probability is essentially the same as 
for the bounds in expectation up to an extra multiplicative term of order log (1/(5). 

4.3 Convex aggregation 

In this subsection, we fix > and assume that A C Ai(i?) is a closed convex set where Ai(i?) 
is the £i ball of IR*^ with radius R defined in (3.10). Note that both a maximum likelihood 
estimator A„ and an oracle A* G argmin^^g^ /C(P/||Pfc'of;^) exist as soon as A is a closed convex 
set (see Ekeland and Temam, 1999, Chapter ii. Proposition 1.2). 

Recall that if ('H,A) satisfies condition 2, Theorems 4.3 and 4.4 also hold. The following 
theorems ensure a better rate for the maximum likelihood aggregate f; over KiiR) when D 
and thus M, becomes much larger than n. It extends the problem of convex aggregation defined 
by Nemirovski (2000), Juditsky and Nemirovski (2000) and Tsybakov (2003) to case where the 
conditional distribution of the response variables is not restricted to be Gaussian. 

Theorem 4.5 Fix R > and let A be any closed convex subset of the ball Ai{R) defined 
in (3.10). Let condition 1 hold and assume that the dictionary H consists of functions satisfying 




(4.9) 
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WfjW ^ 1; for any j = 1, . . . ,M. Then, the maximum likelihood aggregate over A satisfies 

JE[£^Ufx^,A,n)] < AeRB^^f^. (4.10) 

Moreover, if(7i,A) satisfies condition 2, then 

, „2 8eRB , — /logM 

where A* G argmm;^gy\^ /C(P/||P;,/o^-^). 

The bounds of Theorem 4.5 also have a counterpart with high probabihty as shown in the next 
theorem. 

Theorem 4.6 Fix R > and let A be any closed convex subset of the ball Ai{R) defined 
in (3.10). Fix M > 3, let condition 1 hold and assume that the dictionary % consists of 
functions satisfying WfjW < 1, for any j = 1, . . . , M . Then, for any 5 > 0, with probability 1 — 5, 
the maximum likelihood aggregate over A satisfies 

SKLif, , A,^) < 8RB.[^\[^^./hi(2/6)- (4.11) 
^" \ a \ n 

Moreover, if {T-L,A) satisfies condition 2, then on the same event of probability 1 — 5, it holds 

, „o IQRB , /logM ; -— 

Wh^ - hA? < ^^^wy^^/bi(V^, (4.12) 

where A* G argmin^^gy^ /C(Pj||Pfe'of;^)- 

This exphcit logarithmic dependence in the dimension M illustrates the benefit of the ii 
constraint for high dimensional problems. Raskutti et al. (2009) have obtained essentially the 
same result as Theorem 4.6 for the special case of Gaussian linear regression. While their proof 
technique yields significantly larger constants, they also cover the case of aggregation over iq 
balls for q < 1 explicitly. However, their result is limited to the linear regression model where 
the regression function / is of the form / = f^* for some A* G Ai(i?). 

Most of the bounds for convex aggregation that have appeared in the literature hold for 
the expected excess-KL. While many papers provide bounds with high probability (see, e.g. 
Koltchinskii, 2008; Mitchell and van de Geer, 2009, and references therein), they typically do 
not hold for the excess-KL itself but for a quantity related to 

}C{PfWPb'of. ) - CmmjC{PfWPb'of,) , 

where C > 1 is a constant. When the quantity minx^^A K-iPfWPb'of^) is not small enough, such 
bounds can become inaccurate. A notable exception is Nemirovski et al. (2008, Proposition 2.2) 
where the authors derive a result similar to Theorem 4.6 under a different but similar set of 
assumptions. Most importantly, their bounds do not hold for the maximum likelihood estimator 
but for the output of a recursive stochastic optimization algorithm. 

13 



4.4 Discussion 



As mentioned before, it is worth noticing that the technique employed in proving the bounds 
in expectation of the previous subsection yield bounds with high probability at almost no extra 
cost. More precisely, our proofs do not employ the usual techniques to bound the suprema of 
empirical process. 

While the original motivation for aggregation, as put forward by Nemirovski (2000) is to 
aggregate estimators constructed from a hold-out sample, mainly to obtain adaptive estimators 
(see Yang, 2004; Lecue, 2007a; Rigollet and Tsybakov, 2007), it is now standard to present results 
in the pure aggregation framework where the goal is to aggregate deterministic functions as in 
Tsybakov (2003) and the section above. Maximum likelihood aggregation can yield adaptive 
estimators in nonparametric estimation by aggregating projection estimators constructed from a 
preliminary sample. In addition, the new results of Theorem 4.6 potentially yield much stronger 
results than usual adaptation results that are in expectation. Also, such results can be applied 
not only to regression but also to binary classification as detailed in Section 6. 

We finally mention the question of persistence posed by Greenshtein and Ritov (2004) and 
further studied by Greenshtein (2006) and Bartlett et al. (2009). In these papers, the goal is 
to find performance bounds that explicitly depend on n, M and the radius R of the ii ball 
Ai{R). More precisely, allowing M and R to depend on n, persistence asks the question of 
which regime gives remainder terms that converge to 0. While we do not pursue directly this 
question, we obtain such bounds for deterministic design and show that the constrained max- 
imum likelihood estimator on a closed convex subset of the £i ball is persistent as long as 
R = R{n) = o ^Y^n/log(M)^ . The original result of Greenshtein and Ritov (2004) in this sense 

allows only R = o ([n/ log(M)]^/^) but when the design is random with unknown distribution. 
The use of deterministic design in the present paper, makes the prediction task much easier. In- 
deed, a significant amount of work to prove persistence has been made toward describing general 
conditions on the distribution of the design to ensure persistence at a rate R = o ^y^n/log(M)^ , 
as in Greenshtein (2006) and Bartlett et al. (2009). 

5 Optimal rates of aggregation 

In Section 4, we have derived upper bounds for the excess-risk both in expectation and with 
high probability under appropriate conditions. The bounds in expectation can be summarized 
as follows. For each A G {V, Ai(l), IR^^}, there exits an estimator T„ such that 

IE[£:KL(r„,A,-H)] <CA„,A/(A), 
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where C > and 



An,M(A) 



D logM 
— A — 



if A = V 



n 

D 
n 

D 

— A 

n 



n 



if A C M 



M 



(model selection aggregation), 
(linear aggregation), 



(5.13) 



logM 



n 



ifA = Ai(l) (convex aggregation). 



where D < M A n is the dimension of the linear span of the dictionary H and A C IR^''^ means the 
A is either a closed convex subset of M*^ or H*^ itself. Note that for model selection aggregation, 
the estimator that achieves this rate is given by 

r„ = f-^J{D > log M) + f'^JL{D < log M) , 

where A„ is defined in (4.4), f^^ is the maximum likelihood aggregate over Ai(l) and I(-) denotes 
the indicator function. In the rest of the paper, we call D the rank of Ti. Clearly, the lower 
bound for linear aggregation does not hold for any closed convex subset of JR^^ since {0} is such 
a set and clearly A„^a,/({0}) = 0. We will prove the lower bound on the £oo box defined by 



Aoo(l) 



X G IR 



M 



max Ixol < 1 



For linear and model selection aggregation, these rates are known to be optimal in the 
Gaussian case where the design is random but with known distribution (Tsybakov, 2003) and 
where the design is deterministic (RigoUet and Tsybakov, 2010). For convex aggregation, it has 
been established by Tsybakov (2003) (see also RigoUet and Tsybakov, 2010) that the optimal 
rate of convergence for Gaussian regression is of order y^log(l + eM/-^/n)/n, which is equivalent 
to the upper bounds obtained in Theorems 4.5-4.6 of the present paper when M » ^/n but is 
smaller in general. To obtain better rates, one may resort to more complicated, combinatorial 
procedures such as the ones derived in the papers cited above but the full description of this 
idea goes beyond the scope of this paper. 

In this section, we prove that these rates are minimax optimal under weaker conditions 
that are also satisfied by the Bernoulli distribution. The notion of optimality for aggregation 
employed here is the one introduced by Tsybakov (2003). Before stating the main result of this 
section, we need to introduce the following definition. Fix > and let r(K^) be the level set 
of the function b" defined by 

r(K2) = 1^ e ]R . b"{e) > . (5.14) 

In the Gaussian case, it is clear from Table 1 that T{k^) = IR for any < 1. For the cumulant 
function of the bernoulli distribution, when < 1/4, r(«;^) is a compact symmetric interval 
given by 
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Furthermore, we have r(l/4) = {0} and r(K^) = 0, for > 1/4. In the next theorem, we 
assume that for a given > 0, r(K^) is convex. This is clearly the case when the cumulant 
functions b is such that b" is quasi-concave, i.e., satisfies for any 6,9' S IR, u G [0, 1], 

b"{ue + (1 - u)e') > m.m[b"{e),b"{e')] . 

This assumption is satisfied for the Gaussian and Bernoulli distributions. 

Let V denote the class of dictionaries Ti = {/i, . . . ,/a/} such that ||/j|| V ||/j||oo < = 
1, . . . , M. Moreover, for any convex set A C IR^^, denote by /(A) the interval [—Hoc, Hoc], where 



For example, we have 
1(A) = 



Hoc = Hoo{A.) = sup sup sup |fA(a;)| G [0, oo] . 



[—1, 1] if A = V (model selection aggregation), 

]R if A = ]R^^ (linear aggregation), 

[—R, R] if A = Ai{R) (convex aggregation) . 



(5.15) 



For properly state the minimax lower bounds, we use the notation 

£KL{Tn,A,n)=£KL{Tn,A,f,n), 

that makes the dependence in the regression function / explicit. Finally, we denote hy Ef the 
expectation with respect to the distribution Pf. 

Theorem 5.1 Fix M > 2,n > 1, D > 1, > 0, and assume that condition 1 holds. Moreover, 
assume that r(K^) is convex and that for a given set A C ]R^^, we have /(A) C T{k'^). Then, 
there exists a dictionary T-L €T>, with rank less than D, and positive constants c*,6 such that 



inf supPfe'of;, 
T„^r{K^) AeA 



£KL{Tn,A,b' o f^,n) > c.—Al^j{A) 

la 



(5.16) 



and 



inf sup E,,of, [^kl(T„, A,6' o f^,n)] > 5c,— A^^^iA) , 



(5.17) 



where the infimum is taken over all estimators that take values in r(K^) and where 

if A = V (model selection aggregation), 

if Ad Aoo{l) (linear aggregation). 



a:m(a) 



D logM 
— A — 

n n 



D 

n 



A 



n 



D /log(l + eM/Vn) 



n 



if A = Ai(l) (convex aggregation) . 



(5.18) 
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This theorem essentiahy covers the Gaussian and the Bernoulh case for which condition 1 
is satisfied. Lower bounds for aggregation in the Gaussian case have aheady been proved 
in Rigohet and Tsybakov (2010, Section 6) in a weaker sense. Indeed, we enforce here that 
H & V and has rank bounded by D whereas Rigollet and Tsybakov (2010) use unbounded 
dictionaries with rank that may exceed D by a logarithmic multiphcative factor. 

Note that the lower bound concerns only estimators of the regression function that are of 
the form b' o T^. Nevertheless, these are the only estimators that make sense since (2.2) implies 
that / takes values in the range of b' . In addition, observe that from (5.17), the least favorable 
regression functions are of the form / = 6' o f;)^, A € A as it is usually the case in aggregation (see, 
e.g., Tsybakov, 2003). 

A consequence of Theorem 5.1 is that the rates of convergence obtained in Section 4, both in 
expectation and with high probability, cannot be improved without further assumptions except 
for the logarithmic term of convex aggregation. 

The proof of Theorem 5.1 is postponed to subsection 7.4. 

6 Examples 

6.1 Examples of exponential families 

This subsection is a reminder of the versatility of exponential families of distributions and its 
goal is to illustrate conditions 1 and 2 on some examples. Most of the material can be found for 
example in Mccullagh and Nelder (1989). The form of the density described in (2.1) is usually 
referred to as natural form. More generally, an exponential family of univariate distributions is 
defined as a family of distributions with density 



where T{Y) is a given sufficient statistic. Here, only the case where T(-) is the identity function is 
studied but we now recall that it already encompasses many different distributions. Table 1 gives 
examples of distributions that have such a density. For distributions with several parameters, it 
is assumed that all parameters but 6 are known. For the Normal and Gamma distributions, 
the reference measure is the Lebesgue measure whereas for the Bernoulli, Negative Binomial 
and Poisson distributions, the reference measure is the counting measure on Z. For all these 
distributions, the cumulant function b{-) is twice continuously differentiable. 

Observe first that only the Normal and Bernoulli distributions satisfy condition 1. Indeed, 
all other distributions in the table do not have sub-Gaussian tails and therefore, we cannot 
use Lemma 7.1 to control the deviations and moments of the sum of independent random 
variables. Therefore, only Theorem 4.3 applies to the remaining distributions even though 
direct computation of the moments can yield results of the same type as Theorems 4.5-4.6 but 
with bounds that are larger by orders of magnitude. 

Another important message of Table 1 is that the constant k? can depend on the constant 
i^oo defined in (5.15). Consequently the L2 distance ||fj^ — fA*|P is affected by the constant 
and thus by i^oo- However, the constant does not depend on H^o- Therefore, the bounds 
on the excess-KL presented in Theorems 4.5 and 4.6 hold without extra assumption of the 
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(-00,0) 


1 

e 


a 


Negative 
Binomial 


(0,oo) 


r 

l-e« 


1 


Poisson 


IR 




1 



m 

2 

log(l + e^) 
-log(-0) 





B2 
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1 


1 
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4 


(l+e«oc)^ 




OO 


1 


re" 


oo 




(l-e»)2 






OO 





Table 1: Exponential families of distributions and constants in conditions 1 and 2 where H^o is 
defined in (5.15). (Source: Mccullagh and Nelder, 1989) 



dictionary. For the Normal distribution, = = 1 regardless of the value Hoc, which makes 
it a particular case. 

6.2 Bounds for logistic regression with a large dictionary 

Let us now focus on the Bernoulli distribution. Recall that in the setup of binary classifica- 
tion (see, e.g., Boucheron et al., 2005, for a survey on this topic), we observe a collection of 
independent random couples (xi, Yi), . . . , (x„,,y„) such that Yi £ {0, 1} has Bernoulli distribu- 
tion with parameter f{xi),i = 1, . . . ,n. As shown in the survey by Boucheron et al. (2005), 
there exists a tremendous amount of work in this topic and we will focus on the so-called boost- 
ing type algorithms. A dictionary of base classifiers Ti = {fi, ■ ■ ■ , Im}, i-e., functions taking 
values in [—1, 1], is given and training a boosting algorithm consists in combining them in such 
a way that fx{xi) predicts f{xi) well. 

This part of the paper is mostly inspired by Friedman et al. (2000) who propose a statistical 
view of boosting. Specifically, they offer an interpretation of the original AdaBoost algorithm 
introduced in Freund and Schapire (1996) as a sequential optimization procedure that fits an 
extended additive model for a particular choice of the loss function. Then they propose to di- 
rectly maximize the Bernoulli log-likelikhood using quasi-Newton optimization and derive a new 
algorithm called LogitBoost. Even though, we do not detail how maximization of the likelihood 
is performed, LogitBoost aims at solving the same problem as the one studied here. One dif- 
ference here is that while extended additive models assume that there exists A € A C IR*^ such 
that the regression function is of the form / = {b')~^ o^x, maximum likelihood aggregation does 
not. The paper of Friedman et al. (2000) focuses on the optimization side of the problem and 
does not contain finite sample results. A recent attempt to compensate for a lack of statistical 
analysis can be found in Mease and Wyner (2008) and the many discussions that it produced. 
We propose to contribute to this discussion by illustrating some statistical aspects of LogitBoost 
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based on the rates derived in Section 4 and in particular, how its performance depends on the 
size of the dictionary. 

Given a convex subset A C M*^ and a convex function 99 : ]R — )■ IR, training a boosting 
algorithm and more generally a large margin classifier, consists in minimizing the risk function 
defined by 

1 " 

1=1 

over A G A, where Yi = 2Yi — 1 G {—1,1}. It is usually required that ip be monotonically 
non decreasing on IR and satisfy ip{0) = 1. Typical choices for are exp(-) or the hinge 
loss max(- — 1,0). It is not hard to show that minimizing the Kullback-Leibler divergence 
IC{Pf\\Pb' of ;^), is equivalent to choosing 

log(l + e^) 

^(^)= log 2 ' ^^-^^^ 

up the the normalizing constant log 2 that appears to ensure that (^(0) = 1. For the choice of ip 
defined in (6.19), we have 

R^ifx) - mini?^(fA) = T^£KUh,A,'H) . 
AeA log 2 

In boosting algorithms, the size of the dictionary M is much larger than the sample size n so 
that the results of Theorems 4.3-4.4 are useless and it is necessary to constrain A to be in a ball 
Ai(i?) so that i^oo = R- Given that for the Bernoulli distribution, we have a = 1, = 1/4, the 
constants in the main theorems can be explicitly computed and in fact, they remain low. We can 
therefore apply Theorems 4.5-4.6 to obtain the following corollary that gives oracle inequalities 
for the 93-risk Rip, both in expectation and with high probability. We focus on the case where 
M is (much) larger than ^/n as it is usually the case in boosting. 

Corollary 6.1 Consider the boosting problem with a given dictionary of base classifiers and let 
if be the convex function defined in (6.19). Then, the maximum likelihood aggregate over 
Ai{R) defined in (4.7) satisfies 

„ 2eRJTr /logM 

L s^-\ A„/j AeAi(i?.) log2 V n 

Moreover, for any 5 > 0, with probability 1 — S, it holds 

One striking feature of the bounds in this corollary is the simplicity of the constants. Similar 
results can be obtained using standard techniques to control suprema of empirical processes as 
in Massart (2007) and Koltchinskii (2008) for example, but such general techniques are bound 
to yield larger constants. 
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7 Proof of the main results 



In this section, we prove the main theorems. We begin by recalhng some properties of exponential 
famihes of distributions. While similar results can be found in the literature, the results presented 
below are tailored to our needs. In particular, the constants in the upper bounds are explicit 
and kept as small as possible. In this section, for any u G ^2(11^)) denote by \uj\2 its ^2-iiorm 
defined by 



w 2 



7.1 Some useful results on canonical exponential families 

Let y G IR be a random variable with distribution in a canonical exponential family that admits 
a density with respect to a reference measure on IR given by 

Ky;^)=exp{^^^^ + c(y)}, ^GIR. (7.1) 

The cumulant function b{-) not only contains information about the first two moments but all 
the moments of Y through the moment generating function (MGF). Indeed, it can be easily 
shown (see, e.g., Lehmann and Casella, 1998, Theorem 5.10) that the MGF of Y is given by 

IE[e*^] =e a . (7.2) 
Using the MGF we can derive the Chernoff-type bounds presented in the following lemma. 

Lemma 7.1 Let u = (wi, . . . ,ujn) £ IR" be a vector of deterministic weights. Let Yi, . . . ,Yn 
be independent random variables such that Yi has density p{-;6i) defined in (7.1), 6i G IR, 
i = 1, . . . ,n and define the weighted sum = Yll=i'^i^i- Assume that the second derivative of 
b is uniformly bounded: 

supb"{9)<B^. (7.3) 
6>eR 

Then the following inequalities hold, 

IE[exp(s|S- - ]E(5-)|)] < exp (f!:?^) , (7.4) 

P[|5--]E(5-)|>t] <2exp(-^-^-^) , (7.5) 
and for any r >0, we have 

iE\s^-ns';:w<cMi, (7.6) 

where Cr = r(2ai?^)^/^r(r/2) and T(-) denotes the Gamma function. Moreover, for any r > 1, 
the following simpler bound holds 

{-E\S^-lE{S';:)\'^y^' <2B./^\c\2. (7.7) 
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Proof. Using respectively (7.2), (2.2) and (7.3), we get 

1 

E[exp(s(5;;' - IE(5;^)))] = exp (- J^[6(0, + asoj,) - 6(^,) - ascvib' {6,)]^ 

i=l 

< exp (^^) , 

The same inequality holds with s replaced by —s so (7.4) holds. 

We now turn to the proof of (7.5). From the Markov inequality, for any s > 0, we have 

- IE(5-) >t]< e-^*]E[exp(s(5- - ]E(5-)))] . 

Together with (7.4), this inequality yields 

P[5^ - lE(S^) >t]< inf e^^^-^* = e~^^^^2 . 

The same reasoning using s < instead of s > yields (7.5). 
Finally, observe that 

]E|5--lE(S^)r- = w{\S:-nsrj\>t'/^)dt<2j^e^p{-^-^^)dt, 

where we used (7.5) in the last inequality. Using a change of variable, it is not hard to see 
that this bound yields (7.6). To prove (7.7), we use the following upper bound on the Gamma 
function based on Stirling's approximation 

r{z) < V2TTz(^-y , z>l. 

It yields for any r > 1 

(ji/r ^ r'^/rB^/2^[T{r/2)]^^'" < S\/2We^-^ < 2S0F^, 
where, in the first inequality, we used the fact that for any r > 0, r^/*" < e^^*^. 

I 

For the Gaussian distribution AA(0,cj^), recall that a = cr^ and = 1 and (7.5) yields 
the usual tail bound for the sum of independent Gaussian random variables. For the Bernoulli 
distribution, we have a = 1 and = 1/4, which yields Hoeffding's inequality (see, e.g., Massart, 
2007, Proposition 2.7). 

7.2 Proof of Theorems 4.1 and 4.2 

According to (4.1), minimizing A i— t- IC{Pf\\Pb' of x) is equivalent to maximizing A i— )■ L{X) where 

L{X) = {f,h)-{boh,l). (7.8) 
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Note that for any A C IR^^, the set of opthnal solutions A* satisfies 



A* = avgmin IC{Pf\\Pi,i of ^) = argmaxL(A) . 
AeA agA 

Moreover, for any A G A, A* G A*, we have 

L(A*)-L(A) = a£:KL(fA,A,?^). 
For any fixed A £ A^(l), define the following quantities: 

M 



M 



M 



M 



Si\)n Y ^jL{ej) + nL{\) = ^ \ 2f{xi)h{xi) - b o ^{xi) - ^ >^jb ° f,{xi) 



i=l 



(7.9) 



and observe that S{X) = IE[S'„(A)] and that for any A G A^(l), 

n 

Sn{X) - 5(A) = 2 (y^ - f{xi)) h{xi) . 
1=1 

By definition of A, we have for any A G Aj'"(l) that 

5(A) > 5(A) - A„(A) + PH{X) - /5 log M , (7.10) 

where 

n 

An{\) = 2^(yi - /(x,))f^_;,(xi) + (3HCX) - /31ogM 

i=l 

The following lemma is useful to control the term A„(A) both in expectation and with high 
probability. 

Lemma 7.2 Under condition 1, for any A G A^(l) we have 



IE 



(MA) 2B^an ^ ^ „ ^ 
exp I — ^ A, II/,- -6 



< 1. 
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Proof. Using Jensen's inequality, we get 

?2. M 



IE 



exp 



An(A) 2^^^ 



M 



M 



exp I - ^{Y, - f{x.))f^_,{x,) - A, log(MA,) - -J^ ^^H/^' " 



i=l 



/32 



< E 



M 



/2 " 2R2 
E I ;5 E(^^ - f(^WA^^) - fA(x.)) - log(MA,) - -^Wfj - h\ 



M 



1=1 

n 



exp YiY, - /(x,))(/,(x,) - f,(x.)) - ^^Wfj -h"' 



Now, from (7.4), which holds under condition 1, we have for any s > and any A, A' G 
that 



E 



exp ( - YiYi - fi^^)){hix^) - fx'ix,)) 



i=l 



< 



exp 



2B'^an. 



/52 



IfA - fA' 



and the result of the lemma follows from the previous two displays. 



Take any A G argmax^g^+^-^^ S{X) and observe that condition 2 together with a second order 
Taylor expansion of the function S{-) around A gives for any A G A^(l) 



5(A) < SiX) + [Va5(A)] ' (A - A) - ^llfA - f 



All ' 



where \/\S{X) denotes the gradient of A i— ?• ^(A) at A. Since A is a maximizer of A i— ?• S{X) over 
the set to which A also belongs, we find that Va5'(A)~''(A — A) < so that 



5(A)-5(A)>— ||f;,-f3;| 



Together with (7.10), the previous display yields 

2 

^llfj, - hf < S{X) - SCX) < A„(A) + /31ogM, 

where we used the fact that H{X) > 0. 

Proof of Theorem 4.1. 

Using the convexity inequality t < e* — 1 for any t £lR, Lemma 7.2 yields 



2B^an 



M 



E[A„(A)]<^E5]A,||/,-f, 



2B^an 



M 



(7.11) 
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The previous display combined with (7.11) gives 

2B^an 



5(A) -E 5(A) 
It implies that for /3 > SB'^a/n'^ 

5(A) - E r5(A) 



+ /? log M . 



^ 7->2 

< — Ej]A,||/,-f^f + 2/31ogM. 



/3 



(7.12) 



Observe now that a second order Taylor expansion of the function L(-) around A, together with 
condition 2 gives for any A G A^(l) 

,2 



L(A) < L(A) + [VaL(A)]^(A - A) - :;-||fA - f^f 



Thus 



It follows that 



M 2 

5^A,L(e,)<L(A)-^J]A,||/,-f 



M 



2 A'-? 



5(A) = nY, A,i(e,) + nL(A) < 2nL(A) - — E ^^H/i " 

Combined with (7.12), the above inequality yields 

/AT-,') 9 \ A-/ 

5(A) - 2nE L(A) ' 



< 



IE ll/j - ^aII' + 2/3 log M < 2/3 log M ^ 



for /3 > SB^a/K^. 

Note that for any j = 1, . . . , M, 5(A) > S{ej) = 2nL{ej) so that from (7.9), we get 



a'E[£KL{h,V,n)] = ^max^L(ej) - E \l{X) 



/3 



< - log M . 



Proof of Theorem 4.2. From the Markov inequality and Lemma 7.2 we get for any A S A^(l) 
and any 6 > that 



P 



M 



An(A) - ^E^^ll/^- - > /31og(l/5) 



/5 



Thus, the event Ax{6) on which 



M 



< (5. 



A„(A) < ^^^A,||/, -fAf + /31og(l/5) 



/5 



has probability greater that 1 — 5. Theorem 4.2 follows by applying the same steps as in the 
proof of Theorem 4.1 but on the event A\{5) instead of in expectation. 
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7.3 Proof of Theorems 4.3-4.6 

The fohowing lemma exploits the strong convexity property stated in condition 2. 

Lemma 7.3 Let (pi, . . . , (/)£) be an orthonormal basis of the linear span of the dictionary %. Let 
K be a closed convex subset of M^' or itself and assume that (TijA) satisfies condition 2. 
Denote by A* any maximizer of the function A i— >■ L{\) over the set A. Then any maximum 
likelihood estimator A„ satisfies 



,2 



- f,,f < L{X*) - lCK) < (7-13) 

i=i 

where (j = ^ Y17=i ^i4>j{xi) — (/, 4'j)^3 = 1, • • • , -D- Moreover, if A C Ai{R), R > is a closed 
convex set, then A„ satisfies 



D 



K 



2 



- fx4' < L{X*) - L(A„) < 2i? max (7.14) 
where = ^ Y.l=i Yifj{xi) - (/, /,), j = 1, . . . , M. 

Proof. A second order Taylor expansion of the function L(-) around A* gives for any A G A 

L(A) < L(A*) + [VaL(A*)]^(A - A*) - y ||fA - h4' , 

where we used condition 2 and where VaL(A*) denotes the gradient of A i— t- L{X) at A*. Since A* is 
a maximizer of A i— t- L{X) over the set A to which A also belongs, we find that VAi(A*)^(A— A*) < 
so that 

L{X*) - L{X) > - h4\ (7.15) 

for any A G A, which gives the left inequalities in (7.13) and (7.14). 
Next, from the definition of A„, we have 

i(A„)>L(A*)+T„(A*-A„), (7.16) 

where 

1 " 

i=l 

Writing f^ = J2f=i ^ ^ , we find that 

D ^ n D 
Tn{^l) = [- Yi(Pj{xi) - if, <j)j)^ = Y ^jCj , 

j=l 1=1 j=l 

Define the random variable 

|^n(^)| 

Vn = sup 
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so that Vn satisfies 

Since r„(A* — A^) > — 14||f;^._j^ 11, it yields together with (7.16) that 

D 



L{Xn)>L{X*)-\h,^J\{Y,C]) . (7.17) 



Combining (7.17) and (7.15) with A = A„, we get (7.13). 

We now turn to the proof of (7.14). From (7.16), and the Holder inequality, we have 

L{X*) - L(A„) < (j2 l^n,.- - j le.l , 

< 2R max \£A . 

Combined with (7.15), this inequality yields (7.14). | 

In view of (7.9), to complete the proof of Theorems 4.3-4.6, it is sufficient to bound from 
above the quantities appearing on the right hand side of (7.13) and (7.14). This is done using 
results from subsection 7.1 and by observing that the random variables ("j and are of the form 

G- = 5f^-'-lE(5f^'), .p = Mp), M^^)b = ^ (7.18) 

and 

i, = sf^' -nsf''), ^P = M^, |.;fe)b<^, (7.19) 



n \/n 

2 



where the last inequality is obtained under the assumption that maxi<j<M \\fj\\ < 1- 

Proof of Theorem 4.3. Since the random variables 1^, i = 1, . . . , n are mutually independent, 
we have 

i=\ J i=l i=l 

Together with (7.9) and (7.13), this bound completes the proof of Theorem 4.3. 
Proof of Theorem 4.4 
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p 









= p 







D 



For any s, t > 0, we have 

St r s 

D oo 

V > X ^ ^ r u 



< e o . 



(Markov's inequality) 
(Jensen's inequality) 

(Fatou's lemma) . 



But (7.6), which holds under condition 1, and (7.18) yield 



2aB' 



2\P 



n 



Therefore, the last two displays yield 

D 



P 



St X — ^ 

<2e--E 

p=0 



/2saB 



2\P 



\ n 



Finally, taking s = n/{AaB'^) yields 



P 



< 4e iaB-'D . 



Theorem 4.4 follows by taking t = ^^^-^ log (4/(5) in the previous display together with (7.9) 
and (7.13). 

Proof of Theorem 4.5. Using successively Jensen's inequality and (7.7), which holds under 
condition 1, we find that for any r > 1, it holds 



max 



< E 



M 



^^''<Mi/^ max (Ene,rl)^^'' < 2SMi/^ 
l<j<M ^ 



Tiar 



n 



where we used (7.7) and (7.19) in the second inequality. Choosing now r = logM, yields 



max If, I 
i<j<A/ 



< 2eB 



ira log M 



n 



Combined with (7.9) and (7.14) the previous inequality completes the proof of Theorem 4.5. 
Proof of Theorem 4.6 
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For any r > 1, s, t > 0, we have 



P 



max If,- 1 > t 
i<j<M ■' 



< W 



P 



M 



l/r 



1 



> t 



2/r- t 
> 



< e 



o OO „ 



p=0 



(7.20) 



where we used the Markov inequahty and a Taylor expansion coupled with Fatou's lemma 



respectively. We now control the term E 



2p/r- 



Assume first that p < r/2. Then, using respectively Jensen's inequality, (7.19) and (7.7), we 



get 



2p/r / 1 



2p/r 



< (2BVvraf|a;(^J^|2 ) < (2B 



2p 



vrar \ 2p 



Next, if p > r/2, Jensen's inequality yields 



nP 



f2aB 



2\P 



\ n 



where in the last inequality, we used (7.6) and (7.19). 
Recalling that r > 1, the last two displays yield 



oo p ^ 



p=0 ^ j=l 



2p/r 



< y - 

^-^ p! \ n 

0<p<r/2 

47rsai?^r \P 



1 fA'KsaB'^r\P /2saB'^\P 

+ 2 E 



p>r/2 



n 



sE 

p=0 



n 



(7.21) 



Choosing now r = 21ogM and s = n/{167raB^ logM) yields together with (7.20) that 



P 



max IfA > t 

1<J<A/ 



(7.22) 



Together with (7.9) and (7.14), this bound completes the proof of Theorem 4.6 by taking t 
ABJ7rea'^log{2/6). 
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7.4 Proof of Theorem 5.1 

Note first that (5.16) imphes (5.17). Indeed, by the Markov inequality, we have for any /x > 

-Ei,,og [£KL{Tn, A, b' o f;,, n)] > Pb'of, [£KL{Tn, A, b'of^,n)> fj] . 

A* 



Besides, (5.16) follows if we prove 



inf max Ph'on 



>6, (7.23) 



where ^ is a finite family of functions such that Q C {fA : A G A}. 

The rest of the proof consists in two steps. We first reduce the lower bound (7.23) to a lower 
bound on the squared prediction risk. In the second step, we use standard techniques to bound 
the squared prediction risk from below. 

1°. Fix an estimator that takes values in r(K^) and recah that from (4.1), we have for 
any G ^ 



IC{Pb'og\\Pb'oTj = —{{9,Tn) " (6 o T„, I)) - ^IE[c(y,)] + £^(^3) . 
" i=l 

A second order Taylor expansion along the segment {ag + (1 — a)r„ : a G [0, 1]} C r(K^) yields 



K^{Pb'og\\Pb'oTn) > -^Wa - Tn\\ , 

where we used the fact that a = 1 minimizes the function a i— >• K^{Pb'og\\Pb'o[ag+{i-a)Tn]) over 
[0, 1] and that the value at the minimum is zero. Therefore, in view of (7.23), it is sufficient to 
prove that 

infmaxPb,o^{||(7-r„f >c,A;^,^(A)} >6, (7.24) 

Tn g&Q ' 

where the infimum is taken over all estimators. 

2°. The problem has now been reduced to proving a minimax lower bound for estimation 
in squared prediction risk and can be solved using standard arguments from Tsybakov (2009, 
Chapter 2), and in particular Theorem 2.5. This theorem requires upper bounds on the quan- 
tities f(^{Pb''og\\Pb''oh)^ g,h £ G, where P^/'og denotes the joint distribution of the observations 
(Yi, . . . ,Yn) with JEi[Yi\ = b' o g{xi). Since the observations are independent, it holds 



where -P^'og denotes the distribution of Yi with ]E[yj] = b'og{xi). Upper bounds on the Kullback- 
Leibler divergence can be obtained using condition 1. Indeed, for any g,h £ Qi.n, a second order 
Taylor expansion yields 

IC{Pl,,og\\Pkh) < ^i9ioc^) - Kx,)f . (7.25) 
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We now review the conditions that we have already imposed on the family Q to achieve the 
reduction in 1°, together with those that are sufficient to apply Tsybakov's theorem. 

(A) Ti £ T> with rank less than D 

(B) g C {fA : A G A} 

(C) \\g-hf>2c*Al^^j{A), yg,heg 

{D) /C(P,7,J|P,7,J < log(card(G))/8 , V 5> G ^ 

If the four conditions above are satisfied, then Theorem 2.5 of Tsybakov (2009) implies that 
there exists 6 > such that (7.24) and thus (7.23) holds. 

The rest of the proof consists in carefully choosing the family Q and depends on the aggrega- 
tion problem at hand. Several of the subsequent constructions are based on the following class 
of matrices. For any 1 < D < M An, consider the random matrix X of size D x AI such that its 
elements = I, ■ ■ ■ , D, j = 1, . . . , M are i.i.d. Rademacher random variables, i.e., random 

variables taking values 1 and —1 with probability 1/2. 

Assume 5" is a positive integer that satisfies 

§log(l + ^)<Co. (7.26) 

for some positive constant Cq < 1/2. Theorem 5.2 in Baraniuk et al. (2008) (see also subsection 
5.2.1 in Rigollet and Tsybakov, 2010) entails that if (7.26) holds for Co small enough, then there 
exists a nonempty set M(D) of matrices obtained as realizations of the matrix X that enjoy the 
following weak restricted isometry (wRi) property : for any X G M{D), there exists constants 
X > X > 0; such that for any A G IR^^ with at most 2S nonzero coordinates, 

x|A|i<^<x|A|L (7.27) 

when S satisfies (7.26). 

Model selection aggregation. 

Recall that in this case A* jyj(V) = {D A log M)/n and assume first that 

Take the dictionary % = {/i, . . . , /m} to be such that for any j = 1, . . . , M, 



[ otherwise, 

where X G Ai{D) and r G (0,1) is to be chosen later. Clearly, this dictionary has rank less 
than D. We simply choose the family Q = {/i, . . . , /a/} and check conditions {A)-{D). 
Conditions {A)-{B). Since < r < 1 and D < n, we have 



,2_ AlogA/I _ DAlogM 



= ^ n ^ <r'<l, m\oo = ^ < r < 1, , 
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so that (A) holds. Moreover, (i?) clearly holds. 

Conditions (C)-(-D). For any fj,fk S 7^ k, the WRI condition (7.27), which holds for 5 = 2 
under (7.28) yields 

2 £AlogM f 112 ^ n 2-£AlogM 

— n n 

The left inequality implies that (C) holds with c* = t'^x- Finally, (7.25) and the right inequality 
in the above display yield 

^(nVo"/, iinVoi) = E npLf, wpLjJ < -h\?< a bg a/) . 
1=1 

As a result, {D) holds as long as r < \/ a/ {8B^x)- 
Assume now that 

Define L)' > 1 to be the largest integer such that D' < D and 2^' < Af . It is not hard to show 
that if AT > 4, equation (7.29) yields 

D'> D = C,D. 

-21og(l + 2e) 

Besides, if Af < 3, we have D < 3 and D' = 1 so that D' > D/3 > CiD. Consider the 
set of functions (pi, . . . , (po' such that (j)j{xi) = 1, if and only if i = j, i = 1, . . . n. For any 
UJ = {ui, . . . ,U£)i) G {0,1}-^' define the function (p^ = X^jLi'^i'/'i a-^id observe that for any 
a;'G{0,l}«', 

Uu^-cpco'f = -pico,to'), (7.30) 
n 

where p{uj, oo') = Yld=i{^j~^'jf' denotes the Hamming distance between uj and uj' . RigoUet and Tsybakov 
(2010, Lemma 8.3) guarantees the existence of a subset {uj^^\ . . . ,uj^'^^} C {0,1}^' such that 
log d > C2D' and for any 1 < j < k < d, 

p(J^),^W)>:^, (7.31) 

where C2 is a numerical constant. From the definition of D' , we have d < M and we choose the 
dictionary H to be composed of functions fj = T(p^{j),j = I,. . . ,d, where < r < 1 is to be 
chosen later and fj = 0,j = d+l,... M. Clearly, this dictionary has rank less than D' < D. 
We simply choose the family Q = {/i, . . . , fa} and check conditions (A) — (D). 
Conditions {A)-{B). Since < r < 1 and D' < D < n, we have for any j = 1, . . . ,d, 

2 D' 

ll/^-ll' = 77 E^i'^ <r'<l, ||/,||oo = r ^max^..p') < r < 1 , 

i=l 
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so that (^4) holds. Moreover, (B) also holds. 

Conditions {C)-{D). For any fjjk G GJ / k, (7.30) and (7.31) yield 

n An An 

which implies that (C) holds with c* = t^Ci/8. Finally, (7.25) and (7.30) yield 

r/'pl:" llpl:" ^^^^^Wf / ii2 '^^-^^ / (i) I^^^^ 

^(n'o/J^'o/J < ^ll/i - Ml = -^p{^^".^^ n < — - — 



C2 



To complete the proof of (D), it is enough to observe that D' < and to choose r < 
VC2a/(8i?2). 

Linear aggregation. 

Recall that in this case A* ^^(A) = D/n. Recall that for any j = 1,...M, 4>j{xi) = 1 if 
and only if i = j, i = 1, . . . n. Take the dictionary % = {/i, . . . , where fj = T(j)j for some 
r G (0, 1) to be chosen later if 1 < j < D and fj = if j > D. Clearly this dictionary has 
rank less than D. Similarly to the case of model selection covered above, RigoUet and Tsybakov 
(2010, Lemma 8.3) guarantees the existence of a subset {u}^^\ . . . ,0}^'^'^} C {0,1}^ such that 
log d > C^D and for any 1 < j < k < d, 

p{J^lJ''^)>j. (7.32) 

We choose G = {gi, ■ ■ ■ where gj = X^^Li ^k'^ fk ■ We now check conditions {A)-{D). 
Conditions {A)-{B). We have for any j = 1, . . . , d, 

||/,f<-<l, ll/illoo <T< 1, 

n 

so that (A) holds. Moreover, since for any j, maxj < 1, condition (B) also holds for any 
Ad Aoo(l). 

Conditions (C)-(L'). For any gj,gk G G,j ^ k, (7.30) and (7.32) yield 

H-gkf = -p{u:(^\u:('^)>"-^, 
n An 

which implies that (C) holds with c* = t"^ /8. Finally, (7.25) and (7.30) yield 

r2^ ^2 d2 ^2 r2 n 

To complete proof of (D), it is enoug h to observe that D < ^ and to choose r < ^/C^a/(8B^. 

Convex aggregation. 

Recall that in this case 



D iog(i + ^: 

a;m(Ai(i)) = -a' 



n V n 
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If (7.29) holds, then we choose Q to be the same family as in this second part of the proof 
for model selection. Indeed, conditions (A)-{D) have all been checked with A = V C Ai(l). 
Hereafter, we assume that 

We divide the rest of the proof into two cases: large D and small D. For both cases, we take 
the same dictionary Ti = {/i, . . . , Jm} to be such that for any j = 1, . . . , M, i = 1, . . . , n, 
k = l,...D, 

X/^j i = k mod D , 
otherwise, 



fj i-^i) 



where X G Ai{D). Note that the rank of H is at most D and that H & D since 

D 



\\f^f=[n/D\-<l, ||/,||oo = l, 



n 



where [n/D\ denotes the integer part of n/D. Therefore, {A) holds. 

For both cases, the choice of Q relies on the following property. For any i = 1, . . . M, let ^li 
be the subset of {0, 1}^ defined by 



M 



= |w G {0,1}^^ : =£| . (7.34) 

Recall that according to Rigollet and Tsybakov (2010, Lemma 8.3), for any i < M/2, there exists 
a subset {uj^^\ . . . ,u}^'^'>} C 0^ such that logd > C4^1og(l + eM/i) and for any 1 < j < k < d, 



Assume first that D is large: 



Since D < M, it implies that D > v^/n > 2^fn where > 2 is the solution of 



i/ = 3/(2Co)0og(l + ei/). 
Let m be the largest integer such that 



m< / j-^ ^, (7.36) 



and observe that m > 1 if n > log(l + eMj ^/n). But n > D together with (7.33) imply that 



2 / eM\ 1 / eM\ 
„>O>_,0g(^l + _j>_,„g(^l + _j. 



33 



We conclude that jtt, > 1 by observing that Co < 1/2. Furthermore, it clearly holds that 
m < -^/n, which in turn implies that m < M/2 since M > D > 2-y/re. 

According to Rigollet and Tsybakov (2010, Lemma 8.3), there exists a subset {w^^^ (xj^'^)} C 
flm such that logd > C4mlog(l + eM/m) and for any I < j < k < d, 



m 



We choose the family Q = {^i, . . . , gd}, gj = ^^^u) , j = ^, ■ ■ ■ ,d, where r G (0, 1) is to be chosen 
later and check conditions {B)-{D). 

Condition (B). Note that ^ C {fA : A G Ai(l)} since ^ Efc'^fc ^ = r < 1 for any j = 1, . . . d 
and condition (B) holds. 

Conditions {C)~{D). Note that from (7.36) and the monotonicity of the function x i— )• xlog(l + 
eM/x), we have 



m 



( eM\ 1 I n , / eM I / eM\ 



where we used respectively the fact that log(l + a6) < log(l + a) + log(6), a > 0, 6 > 1, and (7.35). 
The previous display implies that (7.26) holds with S = m and we can apply (7.27) to obtain 
that for any gj^gj, ^Q, j ^ k, 



eM^ 



where in the last inequality, we used (7.36). We have proved that (C) holds with c* = r^x/16- 
To prove (D), note that (7.25) and (7.27) yield 

Since m > 1, the definition of m and the fact that m < y/n yield 

n , , f eM\ 4 ^ 

— < Am log 1 H < — log d . 

m \ m J C4 

Choosing r < aC4/ {32B'^x) completes the proof of (D). 

We now turn to the case where D is small. More precisely, assume that 
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Let £ be the largest integer such that 

D 

and let g > be such that 



eM" 

log ( 1 + — ) < Co , (7.38) 



2 r2 / eM\ 

where r £ (0, 1/2) is to be chosen later. It is clear from (7.33) that i > 2. Furthermore, i < M/2 
since L> < M and Co < 1/2 imply that £ = M/2 violates (7.38). Let {a;(^), . . . C be the 

subset obtained from Rigollet and Tsybakov (2010, Lemma 8.3) such that logd > C4^1og(l + 
eM/i) and for any I < j < k < d, 

We choose the family Q = {gi, . . . , gd}-, gj = q^^u) ■, j = ^-i ■ ■ ■ id and check conditions {B)-{D). 
Conditions {B). Note that 

q^f = _nogll+ )< 7.40 

where we used (7.38). 

If M < i^/n, using the fact that £ < M/2, we get 

2,2 r'^CoiD t^CqMD t^CqM^ , ^ 
q^f < — < ^ < \1 < < 1 , 

n 2n 2n 



since r < 1/2. 

If M > A^/n note first that (7.37) yields 



9n 3 I fT 

< 



Thus, using the monotonicity of the function x i— t- xlog(l + eM/x), we get 



log (i + > A / - log 1 + jlog (i + 



where we used the assumption that M > 4\/n in the last but one inequality. As a result, 
9n/(4CoD) violates (7.38) and I < 9n/(4Co£'). It yields 



- 4 - - 
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since r < 1/2 < 2/3. Thus, in both cases q^l"^ < 1, which imphes that Q C {fA : A G Ai(l)} so 
that condition (B) holds. 

Conditions (C)-(D). Note that (7.26) holds with S = £ hy (7.38) and we can apply (7.27) to 
obtain that for any gj, gk & G , j k, 



II l|2 2ur r ||2 ^ ^ bV i^J ^ A ^ eM 
llS'j-fi'fcll = <? > ~ ^ log|l + — 

From the definition of i, we have 



The last two displays complete the proof of (C) with c* = t^xCo/32. To prove (D), note 
that (7.25) and (7.27) yield 

«^^5JinV4) < ^iiu. - u.ii^ < ^"og (i + £^) < ^.og.;. 

Choosing r < aC4/{8B'^x) completes the proof of {D). 
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